NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Federated Contrastive Learning of Graph-Level Representations

Li, Xiang; Agrawal, Gagan; Ramnath, Rajiv; Jin, Ruoming (December 2024, IEEE)

Full Text Available
TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

https://doi.org/10.1145/3721145.3725774

Guan, Jiexiong; Hu, Zhenqing; Antonopoulos, Christos D; Bellas, Nikolaos; Lalis, Spyros; Smirni, Evgenia; Zhou, Gang; Agrawal, Gagan; Ren, Bin (June 2025, ACM)

The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines.
more » « less
Free, publicly-accessible full text available June 8, 2026
Scalable Deep Metric Learning on Attributed Graphs

Li, Xiang; Agrawal, Gagan; Jin, Ruoming; Ramnath, Rajiv (July 2024, LNCS)

We consider the problem of constructing embeddings of large attributed graphs and supporting multiple downstream learning tasks. We develop a graph embedding method, which is based on extending deep metric and unbiased contrastive learning techniques to 1) work with attributed graphs, 2) enabling a mini-batch based approach, and 3) achieving scalability. Based on a multi-class tuplet loss function, we present two algorithms -- DMT for semi-supervised learning and DMAT-i for the unsupervised case. Analyzing our methods, we provide a generalization bound for the downstream node classification task and for the first time relate tuplet loss to contrastive learning. Through extensive experiments, we show high scalability of representation construction, and in applying the method for three downstream tasks (node clustering, node classification, and link prediction) better consistency over any single existing method.
more » « less
Full Text Available
SoD2: Statically Optimizing Dynamic Neural Network Execution

Niu, Wei; Agrawal, Gagan; Ren, Bin (May 2024, ACM)

Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution are becoming common. This paper presents SoD2, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains.
more » « less
Full Text Available
SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

https://doi.org/10.1145/3617232.3624869

Niu, Wei; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
Organizing Records for Retrieval in Multi-Dimensional Range Searchable Encryption

Heidaripour, Mahdieh; Kian, Ladan; Rezapour, Maryam; Holcomb, Mark; Fuller, Benjamin; Agrawal, Gagan; Maleki, Hoda (July 2024, International Conference on Security and Cryptography)

Storage of sensitive multi-dimensional arrays must be secure and efficient in storage and processing time. Searchable encryption allows one to trade between security and efficiency. Searchable encryption design focuses on building indexes, overlooking the crucial aspect of record retrieval. Gui et al. (PoPETS 2023) showed that understanding the security and efficiency of record retrieval is critical to understand the overall system. A common technique for improving security is partitioning data tuples into parts. When a tuple is requested, the entire relevant part is retrieved, hiding the tuple of interest. This work assesses tuple partitioning strategies in the dense data setting, considering parts that are random, 1-dimensional, and multi-dimensional. We consider synthetic datasets of 2,3 and 4 dimensions, with sizes extending up to 2M tuples. We compare security and efficiency across a variety of record retrieval methods. Our findings are: 1. For most configurations, multi-dimensional partitioning yields better efficiency and less leakage. 2. 1-dimensional partitioning outperforms multi-dimensional partitioning when the first (indexed) dimension is any size as long as the query is large in all other dimensions. 3. The leakage of 1-dimensional partitioning is reduced the most when using a bucketed ORAM (Demertiz et al., USENIX Security 2020).
more » « less
Full Text Available
SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

https://doi.org/10.1145/3620666.3651384

Niu, Wei; Sanim, Md_Musfiqur Rahman; Shu, Zhihao; Guan, Jiexiong; Shen, Xipeng; Yin, Miao; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
ForensiBlock: A Provenance-Driven Blockchain Framework for Data Forensics and Auditability

https://doi.org/10.1109/TPS-ISA58951.2023.00025

Akbarfam, Asma Jodeiri; Heidaripour, Mahdieh; Maleki, Hoda; Dorai, Gokila; Agrawal, Gagan (November 2023, IEEE)

Full Text Available
SecFob: A Remote Keyless Entry Security Solution

https://doi.org/10.1109/ISC257844.2023.10293400

Bolt, Braxton; Maleki, Hoda; Agrawal, Gagan; Morris, Jeffrey D.; Farabi, Khan (September 2023, IEEE)
End-to-End LU Factorization of Large Matrices on GPUs

https://doi.org/10.1145/3572848.3577486

Xia, Yang; Jiang, Peng; Agrawal, Gagan; Ramnath, Rajiv (February 2023, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming)

LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
more » « less
Full Text Available

« Prev Next »

Search for: All records